
While the clinical characteristics and hospital courses of patients with COVID-19 have been previously described, particularly in adults, the risk factors associated with hospitalization in pediatric patients has been understudied 1. Fortunately, COVID-19 seems to affect children generally less severely than adults 2. However, some children do become ill enough to require hospitalization. Moreover, as the number of cases continues to rise across the country, COVID-19 cases in pediatric patients also continues to increase 3.
Here, we present our analyses of a dataset of 375 pediatric patients who returned a positive SARS-CoV-2 RT PCR test at a single tertiary care medical center in New York City from the beginning of the COVID-19 pandemic until October 1, 2020.
The dataset contains de-identified patient health information consisting of age (from 0 to up to 23 years), demographic data (including biological sex, zip code of residence, and a socioeconomic status variable), whether there was a preceding visit to the Emergency Department, any past medical history of asthma, obesity, and/or diabetes, and whether there was an admission to the intensive care unit.
The goal of this project was to perform an exploratory analysis of this dataset in order to begin to assess the risk factors associated with pediatric hospitalization for COVID-19 infection. Prior studies have shown that clinical features including low oxygen saturation and abnormal chest imaging are predictive of the need for hospitalization. Our focus in this present study was more on demographic characteristics and pre-existing comorbidities. We hypothesize that increasing age, a past medical history of asthma, obesity, and diabetes are positively associated with hospitalization.
Prior studies in children have focused on the clinical characteristics after hospitalization, as well as hospital course 4, 5. In particular, some children have required admission to the intensive care unit and some have developed a multi-system inflammatory syndrome 6. Fortunately, most children in the community seem to fare well with few developing critical illness.
First we load our necessary packages and set ggplot preferences.
library(tidyverse)
library(usa)
library(mice)
knitr::opts_chunk$set(out.width = "90%")
theme_set(theme_minimal() + theme(legend.position = "bottom"))
options(
ggplot2.continuous.colour = "viridis",
ggplot2.continuous.fill = "viridis",
tigris_use_cache = TRUE)
scale_colour_discrete = scale_color_viridis_d
scale_fill_discrete = scale_fill_viridis_d
knitr::opts_chunk$set(comment = NA, message = FALSE, warning = FALSE, echo = TRUE)
Now, we will load the data and take a look.
# Load the data
ped_covid =
read_csv("./data/p8105_final_ped_covid.csv")
This is a de-identified dataset of pediatric patients from a tertiary care medical center in New York City who tested positive for COVID-19 via the SARS-CoV-2 RT PCR test. The age range in the study is 0 to 22 years of age. An id number for each patient has been randomly generated. There are 375 rows (patients) and 30 columns in this dataset. The variables included in this dataset are id, admitted, age, date_of_birth, ethnicity, gender, censusblock, censusblockgroup, censustract, city, race, ses, state, zip_code_set, eventdatetime, outcomeadmission_admission_1inpatient_admit_service, bmi_yes_or_no, bmi_event_date_time, bmi_value, asthma_date_time, asthma_dx, diabetes_date_time, diabetes_dx, icu_yes_no, icu_date_time, systolic_bp_event_date_time, systolic_bp_value, ed_yes_no_0_365_before, admission_primary_dx, admission_apr_drg. Some of the important variables are date and time of positive covid test (“eventdatetime”), whether the patient was admitted (“admitted”), whether there was a preceding emergency department visit (“ed_yes_no_0_365_before”), whether the patient needed intensive care admission (“icu_yes_no”) and date and time of icu admission (“icu_date_time”), demographic data (age, gender, ethnicity, race, zip code data - predominantly in the Bronx), some past medical history data (bmi data, asthma data, diabetes data) and one vital sign datum (systolic blood pressure).
The dataset is not tidy for a number of reasons including but not limited to:
ped_covid =
read_csv("./data/p8105_final_ped_covid.csv") %>%
mutate(
ethnicity_race = case_when(
race == "R3 Black or African-American" ~ "black",
race == "R2 Asian" ~ "asian",
race == "R5 White" ~ "caucasian",
race == "R1 American Indian or Alaska Native" ~ "american indian",
race == "Multiple Selected" ~ "multiple",
ethnicity == "E1 Spanish/Hispanic/Latino" ~ "latino"
)
) %>%
mutate(
asthma = replace_na(asthma_dx, 0),
asthma = str_replace(asthma, ".*J.*", "1"),
diabetes = replace_na(diabetes_dx, 0),
diabetes = str_replace(diabetes, ".*E.*", "1"),
zip = as.character(zip_code_set),
service = outcomeadmission_admission_1inpatient_admit_service,
ed = ed_yes_no_0_365_before,
admission_dx = admission_apr_drg,
icu = icu_yes_no
) %>%
mutate(obesity = case_when(
bmi_value >= 30 ~ "1",
bmi_value < 30 ~"0"
))
We noted that the dataset contained information about zipcode but not latitude and longitude. To add this information we merged zipcodes with a file in the usa package containing information about latitude and longitude for each zipcode.
# Merge zipcode data with latitude and longitude.
zipcode_df =
usa::zipcodes
ped_covid =
left_join(ped_covid, zipcode_df, by = "zip") %>%
select(admitted, age, gender, ses, zip, eventdatetime, bmi_value, icu, icu_date_time,
systolic_bp_value, ethnicity_race, asthma, diabetes, zip, service, ed, admission_dx,
city.y, obesity, lat, long) %>%
mutate_at(c("admitted", "icu", "ethnicity_race", "asthma", "diabetes",
"ed", "city.y", "obesity"), as.factor) %>%
mutate(
gender = factor(gender, levels = c("F", "M")),
ethnicity_race = factor(ethnicity_race, levels = c("caucasian", "black", "latino", "asian", "american indian", "multiple", "NA"))
) %>%
rename(city = city.y)
An additional issue in the data is that it contains significant amounts of missing data. This is less of an issue for visualizations and logistic regression but will become an issue for model building with machine learning within the caret package. We thus used the mice package in R to impute missing data for later use in model building. This package uses sequential regression multiple imputation, which is advantageous over single imputations in that it accounts for the statistical uncertainty in the imputation process.
# Impute missing data for use in model_building
impute = mice(ped_covid, m=3, seed=111)
datacomplete = complete(impute,2)
# Export csv
write_csv(datacomplete, "./data/datacomplete.csv")
To perform exploratory analyses, we loaded some additional libraries.
library(patchwork)
library(corrplot)
library(mgcv)
library(modelr)
Our goal here is to develop a causation-type model instead of a prediction-targeted one. This is different than the interaction model tool, which was developed for predictive purposes. We will analyze correlations between admittance rate and multiple covariables and then develop a few select models to compare based on apriori knowledge and p-values. Another interesting variable that we would have liked to analyze was admittance to an ICU, but, as can be seen in the data summary, there were too few events of ICU admittance to draw any meaningful conclusions.
library(tidyverse)
library(patchwork)
library(corrplot)
library(mgcv)
library(modelr)
knitr::opts_chunk$set(
fig.width = 6,
fig.asp = 0.65,
out.width = "100%",
out.height = "75%")
theme_set(theme_minimal() + theme(legend.position = "bottom"))
options(
ggplot2.continuous.colour = "viridis",
ggplot2.continuous.fill = "viridis")
scale_colour_discrete = scale_color_viridis_d
scale_fill_discrete = scale_fill_viridis_d
knitr::opts_chunk$set(comment = NA, message = FALSE, warning = FALSE, echo = TRUE)
ped_covid =
read_csv("./data/p8105_final_ped_covid.csv") %>%
mutate(
ethnicity_race = case_when(
race == "R3 Black or African-American" ~ "black",
race == "R2 Asian" ~ "asian",
race == "R5 White" ~ "caucasian",
race == "R1 American Indian or Alaska Native" ~ "american indian",
race == "Multiple Selected" ~ "multiple",
ethnicity == "E1 Spanish/Hispanic/Latino" ~ "latino"
)
) %>%
mutate(
asthma = replace_na(asthma_dx, 0),
asthma = str_replace(asthma, ".*J.*", "1"),
diabetes = replace_na(diabetes_dx, 0),
diabetes = str_replace(diabetes, ".*E.*", "1"),
zip = as.character(zip_code_set),
service = outcomeadmission_admission_1inpatient_admit_service,
ed = ed_yes_no_0_365_before,
admission_dx = admission_apr_drg,
icu = icu_yes_no
) %>%
mutate(obesity = case_when(
bmi_value >= 30 ~ "1",
bmi_value < 30 ~"0"
))
zipcode_df =
usa::zipcodes
ped_covid =
left_join(ped_covid, zipcode_df, by = "zip") %>%
select(admitted, age, gender, ses, zip, eventdatetime, bmi_value, icu, icu_date_time,
systolic_bp_value, ethnicity_race, asthma, diabetes, zip, service, ed, admission_dx,
city.y, obesity, lat, long) %>%
mutate_at(c("admitted", "icu", "ethnicity_race", "asthma", "diabetes",
"ed", "city.y", "obesity"), as.factor) %>%
mutate(
gender = factor(gender, levels = c("F", "M")),
ethnicity_race = factor(ethnicity_race, levels = c("caucasian", "black", "latino", "asian"))
) %>%
rename(city = city.y)
summary(ped_covid)
admitted age gender ses zip
no :250 Min. : 0.00 F :189 Min. :-13.506 Length:375
yes:125 1st Qu.:10.00 M :184 1st Qu.: -6.921 Class :character
Median :18.00 NA's: 2 Median : -4.122 Mode :character
Mean :14.63 Mean : -4.308
3rd Qu.:21.00 3rd Qu.: -2.112
Max. :22.00 Max. : 2.931
NA's :141
eventdatetime bmi_value icu icu_date_time
Length:375 Min. :12.03 0:355 Length:375
Class :character 1st Qu.:18.73 1: 20 Class :character
Mode :character Median :23.20 Mode :character
Mean :25.53
3rd Qu.:30.73
Max. :80.84
NA's :116
systolic_bp_value ethnicity_race asthma diabetes service ed
Min. : 61 caucasian: 18 0:314 0:358 Length:375 0:171
1st Qu.:103 black : 75 1: 61 1: 17 Class :character 1:204
Median :117 latino :189 Mode :character
Mean :116 asian : 8
3rd Qu.:128 NA's : 85
Max. :182
NA's :240
admission_dx city obesity lat
Length:375 Bronx :317 0 :187 Min. :40.58
Class :character Yonkers : 11 1 : 72 1st Qu.:40.83
Mode :character Mount Vernon: 7 NA's:116 Median :40.85
Brooklyn : 6 Mean :40.86
New York : 5 3rd Qu.:40.87
(Other) : 24 Max. :41.52
NA's : 5 NA's :5
long
Min. :-74.20
1st Qu.:-73.90
Median :-73.88
Mean :-73.88
3rd Qu.:-73.86
Max. :-73.67
NA's :5
This is a dataset of 375 pediatric patients 0 to 23 years of age with COVID-19 infection. First, we explore the data by generating various plots to inform our model building.
There appears to be a bimodal distribution of hospital admission as a function of age. Among infants and toddlers less than 5 years of age who test positive for COVID-19, more are admitted than not admitted. Note that this could be due to babies being admitted along with their mothers’ who had COVID-19. After 16 years of age, hospitalizations for COVID-19 infection appear to be less than non-hospitalizations.
admitt_p =
ped_covid %>%
ggplot(aes(x = age, fill = admitted)) +
geom_density(alpha = .6) +
labs(
title = "Admittance/Age Distribution",
x = "Age (Years)",
y = "Density") +
theme(legend.position = "bottom")
admitt_p

The distribution of diabetes, and asthma diagnoses in pediatric patients with COVID-19 infection by age are shown below. We can see that asthma and diabetes are more common above 10 years of age.
diabetes_p =
ped_covid %>%
ggplot(aes(x = age, fill = diabetes)) +
geom_density(alpha = .6) +
labs(
title = "Diabetes/Age Distribution",
x = "Age (Years)",
y = "Density") +
theme(legend.position = "bottom")
diabetes_p

asthma_p =
ped_covid %>%
ggplot(aes(x = age, fill = asthma)) +
geom_density(alpha = .6) +
labs(
title = "Asthma/Age Distribution",
x = "Age (Years)",
y = "Density") +
theme(legend.position = "bottom")
asthma_p

Below, we explore first systolic blood pressure, BMI, and socioeconomic status (SES) by admission status. The median first systolic pressure is higher among admitted patients compared to non-admitted patients. BMI across admission status appears to be similar, with some high BMI outliers in the non-hospitalized group. Median SES is higher among admitted patients. The SES variable was defined by the hospital using multiple economic and educational parameters, with negative values indicating below average SES and positive values indicating above average SES.
Systolic blood pressure had 240 missing values out of 375 so we decided to not include it in the models.
bp_p =
ped_covid %>%
ggplot(aes(x = admitted, y = systolic_bp_value)) +
geom_boxplot() +
labs(
title = "Systolic Blood Pressure by Admission Status",
x = "Hospital Admission Status",
y = "Systolic Blood Pressure")
bp_p

bmi_p =
ped_covid %>%
ggplot(aes(x = admitted, y = bmi_value)) +
geom_boxplot() +
labs(
title = "BMI by Admission Status",
x = "Hospital Admission Status",
y = "BMI Value")
bmi_p

ses_p =
ped_covid %>%
ggplot(aes(x = admitted, y = ses)) +
geom_boxplot() +
labs(
title = "Socioeconomic Status by Admission Status",
x = "Hospital Admission Status",
y = "SES Measure")
ses_p

We can see from the correlation matrix that there is some positive correlation between age and BMI value (0.42), between BMI value and Asthma (0.24) and also between BMI value and diabetes (0.29).
Note that we had categories for American Indian and Multiple/Mixed Race but there were too few counts for correlation significance purposes and were omitted for simplicity.
cor_data =
cor(model.matrix(admitted ~ age + gender + ethnicity_race + asthma + diabetes + bmi_value + ses, ped_covid)[,-1])
cor_data %>%
corrplot(method = "color", addCoef.col = "black", tl.col = "black", tl.srt = 65, insig = "blank" , number.cex = 0.7, diag = FALSE)

When developing some of the models, we started by considering only categorical variables for simplicity along with the continuous variable of age. The ethnicity/race variable was the only non-binary categorical variables which we tried removing as you can see in the two model tables below. However, at the 5% level of significance, age and diabetes were still the only significant variables related to hospitalization.
race_mod =
glm(
admitted ~ age + gender + ethnicity_race + asthma + diabetes + obesity,
data = ped_covid,
family = binomial()
) %>%
broom::tidy() %>%
mutate(
OR = exp(estimate),
CI_lower = exp(estimate - 1.96 * std.error),
CI_upper = exp(estimate + 1.96 * std.error)
) %>%
select(term, OR, starts_with("CI"), p.value) %>%
knitr::kable(digits = 3)
race_mod
| term | OR | CI_lower | CI_upper | p.value |
|---|---|---|---|---|
| (Intercept) | 5.101 | 1.134 | 22.939 | 0.034 |
| age | 0.938 | 0.897 | 0.981 | 0.005 |
| genderM | 0.728 | 0.409 | 1.295 | 0.280 |
| ethnicity_raceblack | 0.336 | 0.086 | 1.318 | 0.118 |
| ethnicity_racelatino | 0.344 | 0.092 | 1.286 | 0.113 |
| ethnicity_raceasian | 0.323 | 0.044 | 2.371 | 0.266 |
| asthma1 | 0.931 | 0.468 | 1.856 | 0.840 |
| diabetes1 | 3.764 | 1.100 | 12.880 | 0.035 |
| obesity1 | 2.166 | 1.070 | 4.384 | 0.032 |
no_race_mod =
glm(
admitted ~ age + gender + asthma + diabetes + obesity,
data = ped_covid,
family = binomial()
) %>%
broom::tidy() %>%
mutate(
OR = exp(estimate),
CI_lower = exp(estimate - 1.96 * std.error),
CI_upper = exp(estimate + 1.96 * std.error)
) %>%
select(term, OR, starts_with("CI"), p.value) %>%
knitr::kable(digits = 3)
no_race_mod
| term | OR | CI_lower | CI_upper | p.value |
|---|---|---|---|---|
| (Intercept) | 1.570 | 0.857 | 2.876 | 0.144 |
| age | 0.945 | 0.910 | 0.980 | 0.003 |
| genderM | 0.865 | 0.519 | 1.444 | 0.580 |
| asthma1 | 0.864 | 0.456 | 1.639 | 0.655 |
| diabetes1 | 5.119 | 1.574 | 16.651 | 0.007 |
| obesity1 | 1.812 | 0.964 | 3.403 | 0.065 |
We tried to include the continuous variables of BMI value and SES. Diabetes and BMI value were significant at the 5% level of significance, but age was insignificant.
complex_mod =
glm(admitted ~ age + gender + ethnicity_race + asthma + diabetes + bmi_value + ses,
data = ped_covid,
family = binomial(link = "logit")
) %>%
broom::tidy() %>%
mutate(
OR = exp(estimate),
CI_lower = exp(estimate - 1.96 * std.error),
CI_upper = exp(estimate + 1.96 * std.error)
) %>%
select(term, OR, starts_with("CI"), p.value) %>%
knitr::kable(digits = 3)
complex_mod
| term | OR | CI_lower | CI_upper | p.value |
|---|---|---|---|---|
| (Intercept) | 1.140 | 0.149 | 8.723 | 0.900 |
| age | 0.930 | 0.862 | 1.004 | 0.063 |
| genderM | 0.534 | 0.261 | 1.095 | 0.087 |
| ethnicity_raceblack | 0.549 | 0.110 | 2.733 | 0.464 |
| ethnicity_racelatino | 0.441 | 0.092 | 2.113 | 0.306 |
| ethnicity_raceasian | 0.246 | 0.017 | 3.623 | 0.307 |
| asthma1 | 0.959 | 0.439 | 2.095 | 0.916 |
| diabetes1 | 5.286 | 1.257 | 22.233 | 0.023 |
| bmi_value | 1.054 | 1.009 | 1.100 | 0.017 |
| ses | 0.990 | 0.873 | 1.123 | 0.876 |
We postulated that the reason age became insignificant was that we did not consider its interactions with other variables. Adding interaction terms between age and BMI as well as age and asthma we found age and diabetes to be the only significant predictors at a 5% level of significance.
age_int_mod =
glm(admitted ~ age + gender + ethnicity_race + asthma + diabetes + bmi_value + ses + age*bmi_value + age*asthma,
data = ped_covid,
family = binomial(link = "logit")
) %>%
broom::tidy() %>%
mutate(
OR = exp(estimate),
CI_lower = exp(estimate - 1.96 * std.error),
CI_upper = exp(estimate + 1.96 * std.error)
) %>%
select(term, OR, starts_with("CI"), p.value) %>%
knitr::kable(digits = 3)
age_int_mod
| term | OR | CI_lower | CI_upper | p.value |
|---|---|---|---|---|
| (Intercept) | 15.954 | 0.329 | 774.203 | 0.162 |
| age | 0.800 | 0.654 | 0.979 | 0.030 |
| genderM | 0.564 | 0.271 | 1.170 | 0.124 |
| ethnicity_raceblack | 0.510 | 0.102 | 2.562 | 0.414 |
| ethnicity_racelatino | 0.425 | 0.087 | 2.065 | 0.289 |
| ethnicity_raceasian | 0.227 | 0.014 | 3.585 | 0.292 |
| asthma1 | 0.604 | 0.046 | 7.900 | 0.701 |
| diabetes1 | 5.708 | 1.347 | 24.188 | 0.018 |
| bmi_value | 0.931 | 0.787 | 1.101 | 0.405 |
| ses | 0.996 | 0.877 | 1.132 | 0.952 |
| age:bmi_value | 1.007 | 0.998 | 1.016 | 0.139 |
| age:asthma1 | 1.028 | 0.887 | 1.192 | 0.712 |
This model was fitted to further illustrate the importance of age interaction terms to making age a significant predictor of hospitalizations due to COVID-19 in those under 23 years of age.
other_int_mod =
glm(admitted ~ age + gender + ethnicity_race + asthma + diabetes + bmi_value + ses + ses*bmi_value,
data = ped_covid,
family = binomial(link = "logit")
) %>%
broom::tidy() %>%
mutate(
OR = exp(estimate),
CI_lower = exp(estimate - 1.96 * std.error),
CI_upper = exp(estimate + 1.96 * std.error)
) %>%
select(term, OR, starts_with("CI"), p.value) %>%
knitr::kable(digits = 3)
other_int_mod
| term | OR | CI_lower | CI_upper | p.value |
|---|---|---|---|---|
| (Intercept) | 1.154 | 0.065 | 20.525 | 0.922 |
| age | 0.930 | 0.862 | 1.004 | 0.064 |
| genderM | 0.535 | 0.261 | 1.095 | 0.087 |
| ethnicity_raceblack | 0.548 | 0.109 | 2.752 | 0.465 |
| ethnicity_racelatino | 0.440 | 0.092 | 2.116 | 0.306 |
| ethnicity_raceasian | 0.245 | 0.016 | 3.785 | 0.314 |
| asthma1 | 0.959 | 0.438 | 2.097 | 0.916 |
| diabetes1 | 5.292 | 1.244 | 22.503 | 0.024 |
| bmi_value | 1.053 | 0.966 | 1.149 | 0.242 |
| ses | 0.992 | 0.662 | 1.488 | 0.970 |
| bmi_value:ses | 1.000 | 0.986 | 1.014 | 0.990 |
We eventually settled on a complex model with all theoretically and statistically plausible interactions, including that between SES and BMI, age and asthma, as well as age and BMI. We can see that age is a significant predictor of hospitalization as well as having diabetes. It is important to note how the 95% confidence intervals for diabetes were very wide, and that is probably due to the fact that only 17 out 375 people in our dataset had diabetes. We kept diabetes in the models regardless because of apriori assumptions that diabetes exacerbates effects of the COVID-19 infection.
int_mod =
glm(admitted ~ age + gender + asthma + ethnicity_race + diabetes + bmi_value + ses + ses*bmi_value + age*asthma + age*bmi_value,
data = ped_covid,
family = binomial(link = "logit")
) %>%
broom::tidy() %>%
mutate(
OR = exp(estimate),
CI_lower = exp(estimate - 1.96 * std.error),
CI_upper = exp(estimate + 1.96 * std.error)
) %>%
select(term, OR, starts_with("CI"), p.value) %>%
knitr::kable(digits = 3)
int_mod
| term | OR | CI_lower | CI_upper | p.value |
|---|---|---|---|---|
| (Intercept) | 15.456 | 0.195 | 1222.505 | 0.220 |
| age | 0.800 | 0.654 | 0.979 | 0.030 |
| genderM | 0.563 | 0.271 | 1.170 | 0.124 |
| asthma1 | 0.604 | 0.046 | 7.901 | 0.701 |
| ethnicity_raceblack | 0.511 | 0.101 | 2.593 | 0.418 |
| ethnicity_racelatino | 0.425 | 0.087 | 2.072 | 0.290 |
| ethnicity_raceasian | 0.229 | 0.014 | 3.790 | 0.303 |
| diabetes1 | 5.692 | 1.329 | 24.373 | 0.019 |
| bmi_value | 0.932 | 0.776 | 1.120 | 0.453 |
| ses | 0.990 | 0.656 | 1.495 | 0.962 |
| bmi_value:ses | 1.000 | 0.986 | 1.014 | 0.975 |
| age:asthma1 | 1.028 | 0.887 | 1.192 | 0.712 |
| age:bmi_value | 1.007 | 0.998 | 1.016 | 0.139 |
Next, we created an interactive dashboard to further explore and visualize the data.
First we loaded additional necessary packages.
library(plotly)
library(leaflet)
library(tidycensus)
library(tmap)
To create a map, we categorized the cities into their respective counties and used the census data from the tidycensus package.
covid_df =
read_csv("./data/p8105_final_ped_covid.csv") %>%
janitor::clean_names() %>%
mutate(county = NA,
county = as.character(county)) %>%
select(city, county, everything()) %>%
mutate(city = tolower(city)) %>%
mutate(city = str_replace(city, "n white plains", "white plains")) %>%
mutate(county = case_when(city == "bronx" ~ "bronx",
city == "brooklyn" ~ "kings",
city == "yonkers" ~ "westchester",
city == "new york" ~ "new york",
city == "mount vernon" ~ "westchester",
city == "new rochelle" ~ "westchester",
city == "white plains" ~ "westchester",
city == "ridgewood" ~ "queens",
city == "nanuet" ~ "rockland",
city == "bergenfield" ~ "bergen",
city == "ossining" ~ "westchester",
city == "monroe" ~ "orange",
city == "newburgh" ~ "orange",
city == "staten island" ~ "richmond",
city == "port chester" ~ "westchester",
city == "spring valley" ~ "rockland",
city == "irvington" ~ "westchester",
city == "flushing" ~ "queens",
city == "chappaqua" ~ "westchester",
city == "new city" ~ "rockland",
city == "ferncliff manor" ~ "westchester",
city == "greenwich" ~ "washington",
city == "haverstraw" ~ "rockland",
city == "suffern" ~ "rockland",
city == "berkeley heights" ~ "union")
) %>%
mutate(eventdatetime = as.Date(eventdatetime, "%m/%d/%Y"),
eventdatetime = format(eventdatetime, "%m-%Y"),
eventdatetime = zoo::as.yearmon(eventdatetime, "%m-%Y")) %>%
mutate(
ethnicity_race = case_when(
race == "R3 Black or African-American" ~ "black",
race == "R2 Asian" ~ "asian",
race == "R5 White" ~ "caucasian",
race == "R1 American Indian or Alaska Native" ~ "american indian",
race == "Multiple Selected" ~ "multiple",
ethnicity == "E1 Spanish/Hispanic/Latino" ~ "latino"
))
Here, we have generated an interactive map that demonstrates average SES and average BMI across various parts of New York and New Jersey, reflecting the residences of pediatric patients with COVID-19 infection. By selecting from the layered box on the top left corner, users can customize different base layers for the map, as well as average SES or average BMI.
# Map Viz using tidycensus and tmap
county_ny = c("bronx", "kings", "westchester", "new york", "queens", "rockland", "orange", "richmond")
county_nj = c("bergen", "union")
shape_ny =
get_acs(geography = "tract",
variables = "B19013_001",
state = "NY",
county = county_ny,
geometry = TRUE) %>%
janitor::clean_names() %>%
select(name, geometry) %>%
separate(name, into = c("county", "state"), sep = -17) %>%
mutate(state = str_sub(state, 10),
county = tolower(county),
county = sub(".*\\s", "", trimws(county)),
county = str_replace(county, "york", "new york"))
shape_nj =
get_acs(geography = "tract",
variables = "B19013_001",
state = "NJ",
county = county_nj,
geometry = TRUE) %>%
janitor::clean_names() %>%
select(name, geometry) %>%
separate(name, into = c("county", "state"), sep = -18) %>%
mutate(state = str_sub(state, 9),
county = tolower(county),
county = sub(".*\\s", "", trimws(county)))
shape_full =
rbind(shape_ny, shape_nj)
bmi_mean =
covid_df %>%
group_by(county) %>%
summarize(bmi_mean = mean(bmi_value, na.rm = TRUE))
ses_mean =
covid_df %>%
group_by(county) %>%
summarize(ses_mean = mean(ses, na.rm = TRUE))
full_map_bmi =
left_join(shape_full, bmi_mean, by = "county")
full_map_ses =
left_join(shape_full, ses_mean, by = "county")
tmap_mode("view")
tm_full =
tm_shape(full_map_bmi) +
tm_fill(
n = 4,
group = "Average BMI",
midpoint = NA,
col = "bmi_mean",
palette = "viridis",
style = "quantile",
contrast = c(0.3, 1),
title = "Average BMI",
textNA = "Not Available",
id = "state",
popup.vars=c("County: " = "county",
"Average BMI: " = "bmi_mean")) +
tm_borders(col = "white") +
tm_shape(full_map_ses) +
tm_fill(
n = 4,
group = "Average SES",
midpoint = NA,
col = "ses_mean",
palette = "RdYlBu",
style = "quantile",
contrast = c(0.3, 1),
title = "Average SES",
textNA = "Not Available",
id = "state",
popup.vars=c("County: " = "county",
"Average SES: " = "ses_mean")) +
tm_borders(col = "white") +
tm_view(
alpha = 0.85,
view.legend.position = c("right", "bottom"),
colorNA = NULL) +
tm_scale_bar(text.size = 1) +
tm_facets(as.layers = TRUE, sync = TRUE)
tm_full %>%
tmap_leaflet() %>%
leaflet::hideGroup("Average SES")
The number of pediatric patients testing positive for COVID-19 is shown as a function of time, by admission status (yes/no). The number of patients who returned positive SARS-CoV-2 RT-PCR tests peaked in late February and decreased until April. From April to June, levels of positive tests generally plateaued for both admitted and non-admitted patients until June, at which point levels dropped off to close to zero for both groups.
plot_1 =
covid_df %>%
group_by(eventdatetime, admitted) %>%
summarize(count = n()) %>%
ggplot(aes(x = eventdatetime,
y = count,
color = admitted)) +
geom_point(aes(text = paste("Date: ", eventdatetime,
"\nNumber of Counts: ", count,
"\nAdmitted: ", admitted))) +
geom_line() +
labs(title = "",
x = "Event Date",
y = "Number of Events",
color = "Admitted")
ggplotly(plot_1, tooltip = "text")
The ethnic and racial background of pediatric patients with COVID-19 infection reflect the diverse population served in the Bronx. Latinos represent the majority of patients in whom positive SARS-CoV-2 RT-PCR test results were returned, followed by blacks, caucasians, and asians. Overall, latino and black children with COVID-19 infection seem to have lower hospitalization rates compared to caucasian and asian children.
plot_2 =
covid_df %>%
group_by(ethnicity_race, admitted) %>%
summarize(count = n()) %>%
filter(ethnicity_race != "american indian") %>%
filter(ethnicity_race != "multiple") %>%
ggplot(aes(x = fct_reorder(ethnicity_race, count),
y = count,
fill = admitted,
text = paste("Ethnicity: ", ethnicity_race,
"\nCounts: ", count,
"\nAdmitted: ", admitted))) +
geom_bar(stat="identity", position=position_dodge()) +
coord_flip() +
labs(title = "",
x = "Counts",
y = "Ethnicity",
fill = "Admitted")
ggplotly(plot_2, tooltip = "text")
The below interactive bar graph is another representation of our earlier exploratory ggplot showing the density or count of COVID-19 positivity as function of age. This information is further stratified by admission status (yes/no).
plot_3 =
covid_df %>%
mutate(age = round(age),
age = as.factor(age)) %>%
group_by(age, admitted) %>%
summarize(count = n()) %>%
ggplot(aes(x = age,
y = count,
fill = admitted,
text = paste("Age: ", age,
"\nCounts: ", count,
"\nAdmitted: ", admitted))) +
geom_bar(stat="identity", position=position_dodge()) +
labs(title = "",
x = "Age",
y = "Counts",
fill = "Admitted")
ggplotly(plot_3, tooltip = "text")
The scatterplot shows BMI values versus age by admission status with a smooth curve. In general, BMI values increase as age increases for both admitted and non-admitted patients. The smooth curves generally overlap but diverge beginning at around age 14, with a trend for more admissions for extremely obese patients with high outlier BMI values.
plot_4 =
covid_df %>%
ggplot(aes(x = age, y = bmi_value, color = admitted)) +
geom_point(aes(text = paste("Age: ", age,
"\nBMI: ", bmi_value,
"\nAdmitted: ", admitted))) +
geom_smooth(se = FALSE) +
labs(title = "",
x = "Age",
y = "BMI",
color = "Admitted")
ggplotly(plot_4, tooltip = "text")
The below interactive plot shows plots of BMI as a function of age, stratified by race. In general, BMI values seem to increase as age increases. There are outliers with some extreme BMI values, including some with a BMI of greater than 60.
plot_5 =
covid_df %>%
filter(ethnicity_race != "multiple") %>%
ggplot(aes(x = age, y = bmi_value, color = ethnicity_race)) +
geom_point(aes(text = paste("Age: ", age,
"\nBMI: ", bmi_value,
"\nEthnicity: ", ethnicity_race))) +
geom_smooth(se = FALSE) +
labs(title = "",
x = "Age",
y = "BMI",
color = "Ethnicity")
ggplotly(plot_5, tooltip = "text")
The below interactive box plot explores the relationship between BMI and gender. The median BMI is higher in female children with COVID-19 infection compared to male children. There are some high BMI outliers, particularly among male children, with one BMI of 80.84.
plot_6 =
covid_df %>%
filter(gender != "U") %>%
ggplot(aes(x = gender, y = bmi_value, fill = gender)) +
geom_boxplot(draw_quantiles = c(0.25, 0.5, 0.75)) +
stat_summary(fun = "mean", color = "blue") +
labs(
title = "",
x = "Gender",
y = "BMI",
caption = ""
) +
theme(legend.position="none")
ggplotly(plot_6)
This interactive scatterplot explores the relationship between SES and BMI in this dataset. In general, SES and BMI values are clustered at lower values of BMI with a spread of SES that seems similar for males and females between 0 and -10. Briefly, the SES variable is a score generated from census-block groups with increasing score signifying an increasing neighborhood socioeconomic advantage. BMI values appears to be unevenly distributed with clustering of values below 30 and clustering of values greater than 30. Again, we observe the high outlier values of BMI for males and these appear to be distributed in the lower half of SES values.
plot_7 =
covid_df %>%
filter(gender != "U") %>%
ggplot(aes(x = bmi_value, y = ses, color = gender)) +
geom_point(aes(text = paste("SES: ", ses,
"\nBMI: ", bmi_value,
"\nGender: ", gender))) +
labs(title = "",
x = "BMI",
y = "SES",
color = "Gender")
ggplotly(plot_7, tooltip = "text")
The below scatterplot further explores BMI and age in a faceted plot by gender. Hovering over each observation will reveal the specific age, BMI, gender, as well as SES. Futhermore, the area of each circle corresponds to the SES of that particular patient.
plot_8 =
covid_df %>%
filter(gender != "U") %>%
ggplot(aes(x = age,
y = bmi_value,
size = ses,
color = gender,
text = paste("Age: ", age,
"\nSES: ", ses,
"\nBMI: ", bmi_value,
"\nGender: ", gender))) +
geom_point(alpha=0.4) +
scale_size(range = c(1, 8)) +
theme(legend.position="none") +
facet_grid(.~gender) +
labs(title = "",
x = "Age",
y = "BMI",
color = "Gender")
ggplotly(plot_8, tooltip = "text")
The below interactive bar graph explores the relationship between average age of COVID-19 infected children and geographic location. The average age of pediatric patients was 14.5 years in the Bronx, which returned the greatest positivity count. Staten Island had the oldest average age (22 years) and Berkeley Heights had the youngest average age (5 years), but there was only one patient from each of these respective locations.
plot_9 =
covid_df %>%
group_by(city, county) %>%
summarize(mean_age = mean(age, na.rm = TRUE),
count = n()) %>%
drop_na(city) %>%
ggplot(aes(x = fct_reorder(city, mean_age),
y = mean_age,
fill = count,
text = paste("City: ", city,
"\nAvg Age: ", mean_age,
"\nCounts: ", count))) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
labs(title = "",
x = "City",
y = "Average Age",
fill = "Count")
ggplotly(plot_9, tooltip = "text")
This heat map explores the density of BMI and SES, suggesting two regions of greatest density for these variables. One “hot” patch seems to be defined by SES values ranging from -6.5 to -7.5 and BMI values from 20.5 to 24. The second hot patch consists of higher SES values, with a range of -1.3 to -2.9 and BMI values ranging from 22 to 25.
plot_10 =
covid_df %>%
ggplot(aes(x = ses, y = bmi_value)) +
stat_density_2d(aes(fill = ..density..), geom = "raster", contour = FALSE) +
ylim(10, 35) +
theme(legend.position = "right") +
labs(title = "",
x = "SES",
y = "BMI",
fill = "Density")
ggplotly(plot_10)
Our next goal was to build an interactive predictive tool allowing users to see the effects of changing certain predictors on risk of hospitalization. To do this we examined several predictive models.
Based on our exploratory analysis as well as a priori knowledge, we incorporated the following variables into our predictive model for hospitalization:
Our goal in the following analyses was to examine the area under the operating characteristic curve (AUC) for various models. We used the caret package in R, which allows easy implentation and tuning of various different machine learning algorithms.
First we loaded in the data and divided the data into training and testing sets using a 2:1 ratio.
# Load in the data and divide into training and testing sets
library(caret)
library(tidyverse)
library(ROCR)
datacomplete = read_csv("./data/datacomplete.csv") %>%
mutate_at(c("admitted", "ethnicity_race", "asthma", "diabetes"), as.factor) %>%
select(admitted, age, bmi_value, systolic_bp_value, ethnicity_race, asthma, diabetes)
train_data =
datacomplete %>%
slice(1:250)
test_data =
datacomplete %>%
slice(251:375)
First we used a random forest model, which is an ensemble learning method for classification or regression which constructs a multitude of decision trees using a training set and outputs the class that is the mode of the classes of the individual trees.
myFolds = createFolds(train_data$admitted, k = 5)
rfControl = trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds
)
set.seed(123)
rfmodel = caret::train(
admitted ~.,
data = train_data,
metric = "ROC",
method = "ranger",
trControl = rfControl
)
+ Fold1: mtry= 2, min.node.size=1, splitrule=gini
- Fold1: mtry= 2, min.node.size=1, splitrule=gini
+ Fold1: mtry= 6, min.node.size=1, splitrule=gini
- Fold1: mtry= 6, min.node.size=1, splitrule=gini
+ Fold1: mtry=10, min.node.size=1, splitrule=gini
- Fold1: mtry=10, min.node.size=1, splitrule=gini
+ Fold1: mtry= 2, min.node.size=1, splitrule=extratrees
- Fold1: mtry= 2, min.node.size=1, splitrule=extratrees
+ Fold1: mtry= 6, min.node.size=1, splitrule=extratrees
- Fold1: mtry= 6, min.node.size=1, splitrule=extratrees
+ Fold1: mtry=10, min.node.size=1, splitrule=extratrees
- Fold1: mtry=10, min.node.size=1, splitrule=extratrees
+ Fold2: mtry= 2, min.node.size=1, splitrule=gini
- Fold2: mtry= 2, min.node.size=1, splitrule=gini
+ Fold2: mtry= 6, min.node.size=1, splitrule=gini
- Fold2: mtry= 6, min.node.size=1, splitrule=gini
+ Fold2: mtry=10, min.node.size=1, splitrule=gini
- Fold2: mtry=10, min.node.size=1, splitrule=gini
+ Fold2: mtry= 2, min.node.size=1, splitrule=extratrees
- Fold2: mtry= 2, min.node.size=1, splitrule=extratrees
+ Fold2: mtry= 6, min.node.size=1, splitrule=extratrees
- Fold2: mtry= 6, min.node.size=1, splitrule=extratrees
+ Fold2: mtry=10, min.node.size=1, splitrule=extratrees
- Fold2: mtry=10, min.node.size=1, splitrule=extratrees
+ Fold3: mtry= 2, min.node.size=1, splitrule=gini
- Fold3: mtry= 2, min.node.size=1, splitrule=gini
+ Fold3: mtry= 6, min.node.size=1, splitrule=gini
- Fold3: mtry= 6, min.node.size=1, splitrule=gini
+ Fold3: mtry=10, min.node.size=1, splitrule=gini
- Fold3: mtry=10, min.node.size=1, splitrule=gini
+ Fold3: mtry= 2, min.node.size=1, splitrule=extratrees
- Fold3: mtry= 2, min.node.size=1, splitrule=extratrees
+ Fold3: mtry= 6, min.node.size=1, splitrule=extratrees
- Fold3: mtry= 6, min.node.size=1, splitrule=extratrees
+ Fold3: mtry=10, min.node.size=1, splitrule=extratrees
- Fold3: mtry=10, min.node.size=1, splitrule=extratrees
+ Fold4: mtry= 2, min.node.size=1, splitrule=gini
- Fold4: mtry= 2, min.node.size=1, splitrule=gini
+ Fold4: mtry= 6, min.node.size=1, splitrule=gini
- Fold4: mtry= 6, min.node.size=1, splitrule=gini
+ Fold4: mtry=10, min.node.size=1, splitrule=gini
- Fold4: mtry=10, min.node.size=1, splitrule=gini
+ Fold4: mtry= 2, min.node.size=1, splitrule=extratrees
- Fold4: mtry= 2, min.node.size=1, splitrule=extratrees
+ Fold4: mtry= 6, min.node.size=1, splitrule=extratrees
- Fold4: mtry= 6, min.node.size=1, splitrule=extratrees
+ Fold4: mtry=10, min.node.size=1, splitrule=extratrees
- Fold4: mtry=10, min.node.size=1, splitrule=extratrees
+ Fold5: mtry= 2, min.node.size=1, splitrule=gini
- Fold5: mtry= 2, min.node.size=1, splitrule=gini
+ Fold5: mtry= 6, min.node.size=1, splitrule=gini
- Fold5: mtry= 6, min.node.size=1, splitrule=gini
+ Fold5: mtry=10, min.node.size=1, splitrule=gini
- Fold5: mtry=10, min.node.size=1, splitrule=gini
+ Fold5: mtry= 2, min.node.size=1, splitrule=extratrees
- Fold5: mtry= 2, min.node.size=1, splitrule=extratrees
+ Fold5: mtry= 6, min.node.size=1, splitrule=extratrees
- Fold5: mtry= 6, min.node.size=1, splitrule=extratrees
+ Fold5: mtry=10, min.node.size=1, splitrule=extratrees
- Fold5: mtry=10, min.node.size=1, splitrule=extratrees
Aggregating results
Selecting tuning parameters
Fitting mtry = 2, splitrule = gini, min.node.size = 1 on full training set
pred_rf = predict(rfmodel,test_data,type="prob")
pred_rf = pred_rf$yes
auc_rf = as.numeric(ROCR::performance(prediction(pred_rf,test_data$admitted),"auc")@y.values)
auc_rf
[1] 0.5561543
Our AUC with the random forest model is 0.5561543.
The XGBoost model is a powerful ensemble machine learning algorithm that implements gradient boosted decision trees.
xgbTrainingControl = trainControl(method = "CV",
number = 5,
savePredictions = TRUE,
classProbs = TRUE,
summaryFunction = twoClassSummary,
verboseIter = F)
nrounds = 1000
xgb_grid = expand.grid(
nrounds = seq(from = 200, to = nrounds, by = 50),
eta = c(0.025, 0.05, 0.1, 0.3),
max_depth = c(2, 3, 4, 5, 6),
gamma = 0,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1
)
set.seed(123)
fit_xgb = caret::train(admitted~.,
data = train_data,
metric = "ROC",
method = "xgbTree",
tuneGrid = xgb_grid,
trControl = xgbTrainingControl)
pred_xgb = predict(fit_xgb,test_data,type="prob")
pred_xgb = pred_xgb$yes
auc_xgb = as.numeric(ROCR::performance(ROCR::prediction(pred_xgb,test_data$admitted),"auc")@y.values)
auc_xgb
[1] 0.6304594
The AUC generated by the XGB model is 0.6304594.
Next we explored the use of a generalized linear model through Glmnet, which fits a generalized linear model via penalized maximum likelihood.
glmnetControl = trainControl(
method = "cv",
number = 10,
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE
)
set.seed(123)
glmnetmodel = caret::train(
admitted ~. ,
data=train_data,
tuneGrid = expand.grid(
alpha=0:1,
lambda=seq(0.0001,1, length=20)
),
method = "glmnet",
trControl = glmnetControl,
preProcess= c("center","scale")
)
+ Fold01: alpha=0, lambda=1
- Fold01: alpha=0, lambda=1
+ Fold01: alpha=1, lambda=1
- Fold01: alpha=1, lambda=1
+ Fold02: alpha=0, lambda=1
- Fold02: alpha=0, lambda=1
+ Fold02: alpha=1, lambda=1
- Fold02: alpha=1, lambda=1
+ Fold03: alpha=0, lambda=1
- Fold03: alpha=0, lambda=1
+ Fold03: alpha=1, lambda=1
- Fold03: alpha=1, lambda=1
+ Fold04: alpha=0, lambda=1
- Fold04: alpha=0, lambda=1
+ Fold04: alpha=1, lambda=1
- Fold04: alpha=1, lambda=1
+ Fold05: alpha=0, lambda=1
- Fold05: alpha=0, lambda=1
+ Fold05: alpha=1, lambda=1
- Fold05: alpha=1, lambda=1
+ Fold06: alpha=0, lambda=1
- Fold06: alpha=0, lambda=1
+ Fold06: alpha=1, lambda=1
- Fold06: alpha=1, lambda=1
+ Fold07: alpha=0, lambda=1
- Fold07: alpha=0, lambda=1
+ Fold07: alpha=1, lambda=1
- Fold07: alpha=1, lambda=1
+ Fold08: alpha=0, lambda=1
- Fold08: alpha=0, lambda=1
+ Fold08: alpha=1, lambda=1
- Fold08: alpha=1, lambda=1
+ Fold09: alpha=0, lambda=1
- Fold09: alpha=0, lambda=1
+ Fold09: alpha=1, lambda=1
- Fold09: alpha=1, lambda=1
+ Fold10: alpha=0, lambda=1
- Fold10: alpha=0, lambda=1
+ Fold10: alpha=1, lambda=1
- Fold10: alpha=1, lambda=1
Aggregating results
Selecting tuning parameters
Fitting alpha = 1, lambda = 0.0527 on full training set
pred_glmnet = predict(glmnetmodel,test_data,type="prob")
pred_glmnet = pred_glmnet$yes
auc_glmnet = as.numeric(ROCR::performance(prediction(pred_glmnet,test_data$admitted),"auc")@y.values)
auc_glmnet
[1] 0.6202496
The AUC of the glmnet model is 0.6202496.
Finally, we tried a partial least squares model, which is a linear regression method that can deal with large numbers of predictors, small sample size, and high collinearity among predictors. That is less the case in our trimmed down set of predictors, but nonetheless we decided to examine how it would perform in our dataset.
#SPLS
train_grid = expand.grid(K =c(3,4,5,6,7,8,9),eta =c(0.35,0.4,0.45,0.5,0.55,0.6,0.7,0.8,0.9),
kappa=0.5)
set.seed(123)
plsFit = caret::train(admitted~.,
data = train_data,
method = "spls",
tuneGrid = train_grid,
trControl = xgbTrainingControl,
metric = "ROC")
pred_plsda = predict(plsFit,test_data,type="prob")
pred_plsda = pred_plsda$yes
auc_plsda = as.numeric(ROCR::performance(prediction(pred_plsda,test_data$admitted),"auc")@y.values)
auc_plsda
The AUC for the SPLS model is 0.6101815.
Overall, all the AUCs are much lower than what would be considered useful in clinical practice. However, given that the goal of our analyses is mainly exploratory at this point, we then built an interactive predictive tool allowing users to see the effects of changing certain predictors on risk of hospitalization. Results should clearly be interpreted with caution and the models will benefit from the addition of other predictors which can be added on as this dataset grows. We ultimately chose to use the random forest model, given similar AUCs between each of the models and the relatively quick speed at which the random forest model generates its predictions.
Code for the shiny app can be found at Lusha’s github.
We performed an exploratory analysis of a dataset of 375 pediatric patients who tested positive for COVID-19 and assessed predictors for hospitalization. We found age, obesity, and diabetes were associated with the need for hospitalization in multivariable logistic regression modeling. Gender, ethnicity, and asthma were not found to be significantly associated with the need for hospitalization. There are a number of important limitations in this dataset that may affect the validity of the study results. First, the availability of testing was limited in the early phases of the COVID-19 pandemic. This decreased its diagnostic utility, with SARS-CoV-2 RT PCR testing done in a confirmatory manner sometimes after the decision to admit was already made. This differential testing may affect the validity of study results. The sample size of 375 from a single tertiary care medical center is also limited, reflecting the relative overall decreased severity of illness of COVID-19 in pediatric patients compared to adults. Asthma was coded in a simplistic manner that does not reflect the complexity of this comorbidity. For example, asthma can vary in severity from, for example, mild intermittent to status asthmaticus. Our coding of asthma does not capture the gradient of disease severity. Obesity was also coded as a simple binary variable with a BMI cut-point of 30. In reality, obesity in children is defined by the 95th percentile of BMI, which varies by age.
Finally, the clinical decision to admit a pediatric patient with confirmed or suspected COVID-19 infection is a complex one. This dataset contained limited information about vital signs such as respiratory rate or oxygen saturation, which reflect acute respiratory distress. While past medical history is important, acute changes in a child’s pulmonary status are more clinically important in the decision to admit or not admit. These data were not available in this dataset. Notwithstanding these limitations, the exploratory data analyses in this project set the foundation for future research, which should take into account additional clinical data.